-
Notifications
You must be signed in to change notification settings - Fork 31
fix array OOBE in blocked bloom filter when top 4 bits of hash are se… #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix array OOBE in blocked bloom filter when top 4 bits of hash are se… #14
Conversation
…t (seed dependent behaviour)
|
In the blocked bloom filters was it intended that the same key would never be mapped to the same word (they now do whenever the upper 4 bits of the hash are zero), hence the plus 1? Should the solution be to increase (double?) the size of the bitset instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments... Unfortunately I don't have much time currently to work on this, but I will give you access rights so you can commit yourself once the issues are addressed.
fastfilter/src/main/java/org/fastfilter/bloom/BlockedBloom.java
Outdated
Show resolved
Hide resolved
| public class BlockedBloom implements Filter { | ||
|
|
||
| public static BlockedBloom construct(long[] keys, int bitsPerKey) { | ||
| long n = keys.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using long doesn't make sense here... I used long in some places to avoid integer overflow, but here it's not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases where you can reason that overflow is impossible, accumulating into a long can cause significant degradation in throughput once the loop has been compiled by C2. I can share some benchmarks which demonstrate this. Of course, sometimes it's the only safe thing to do...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I wasn't aware of this... I will keep this in mind!
| } | ||
| int si = 0; | ||
| while (si < 2 * keys.length) { | ||
| while (si < 2 * keys.length && qi > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The XorSimple is (on purpose) the most simple possible implementation. It is only for educational purposes. It shouldn't have any performance improvements. But of course it shouldn't have any bugs either... So not sure if this is really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes a bug where qi goes negative in the loop, rather than fail to map. This occurred for the first time with the seed in the regression tests, but cannot be reproduced because of the unseeded calls to new Random().nextLong() (which #13 aims to fix). Basically there do exist cases where this loop will access Q[-1] rather than fail to map and try again, but it's nondeterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest you omit this change from your PR and, instead, report it as a bug. It seems likely to me that your fix (qi > 0) is just patching over a logical error of some kind. It would be better to identify the logical error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I will do that.
|
@thomasmueller I sensed this wasn't the correct fix for the blocked bloom filters so will address that. |
…simple fuzzer back, ignoring MPHF and GCS2
fcfaab9 to
71f78bc
Compare
|
@lemire do you have any time to look at this? |
|
@richardstartin See my proposal. I think that only one change you are making is somewhat controversial in that it might be hiding a real bug instead of exposing it. I recommend you omit this change and then I think that the whole PR could be safely merged without any controversy. |
|
I would love to give you more feedback, but I'm afraid I have very little time now... Next year should be better. |
…CS2, COUNTING_BLOOM
|
I will leave the branch like this - there are several false negative bugs which aren't trivial to fix, and I've reverted the change to |
|
I recommend merging this PR. Do you agree? |
|
I think it would break the build, see https://travis-ci.com/FastFilter/fastfilter_java/builds/142464328 What about disabling the tests for now, so the build passes, and create issues for the problems? That way I think merging is fine. |
|
You shouldn't merge the PR, I've left it to break to highlight the existence of various bugs which need fixing or otherwise removing. |
|
Ok. My view was that it does not really break anything, it merely exposes existing problems and thus is safe to merge. But I see that Richard has opened distinct bug issues. |
|
@lemire I don't think there is a rush to merge it. I was keen to help get the project to the point where it could be released and used, but there are several nondeterministic bugs I don't have enough context to be able to fix, so my goal may have been a little premature. The fixes which are made in this PR are trivial, and merging the branch will just break master. It's worth leaving around for now to verify that the harder fixes have been made. |
|
Ok. |
|
I merged the patch, as it improves the code a lot. I will try to resolve the remaining issues. One option is to move the buggy implementations (specially MPHF, GCS2) to the "test" source folder, until they are fixed (kind of like "staging"). |
|
@thomasmueller I think moving them to test makes sense, whilst there are completely new filter implementations, the availability of some filters for evaluation purposes only could be confusing for users. |
|
+1 |
…t (seed dependent behaviour)